Searching DNA databases for similarities to DNA sequences: when is a match significant?

نویسندگان

  • I. Anderson
  • Andy Brass
چکیده

MOTIVATION Searching DNA sequences against a DNA database is an essential element of sequence analysis. However, few systematic studies have been carried out to determine when a match between two DNA sequences has biological significance and this is limiting the use that can be made of DNA searching algorithms. RESULTS A test set of DNA sequences has been constructed consisting of artificially evolved and real sequences. This set has been used to test various database searching algorithms (BLAST, BLAST2, FASTA and Smith-Waterman) on a subset of the EMBL database. The results of this analysis have been used to determine the sensitivity and coverage of all of the algorithms. Guidelines have been produced which can be used to assess the significance of DNA database search results. The Smith-Waterman algorithm was shown to have the best coverage, but the worst sensitivity, whereas the default BLASTN algorithm (word length set to 11) was shown to have good sensitivity, but poor coverage. A sensible compromise between speed, sensitivity and coverage can be obtained using either the FASTA or BLAST (word length set to 6) algorithms. However, analysis of the results also showed that no algorithm works well when the length of the probe sequence is <200 bases. In general, matches can accurately be identified between coding regions of DNA sequences when there is >35% sequence identity between the corresponding proteins. Searching a DNA sequence against a DNA sequence database can, therefore, be a useful tool in sequence analysis. AVAILABILITY The test sets used are available via anonymous ftp from mbisg2.sbc.man.ac.uk in the directory /pub/cabios/testdata/ CONTACT [email protected]; [email protected]

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Searching the genome of beluga(Husohuso) for sex markers based on targeted Bulked SegregantAnalysis (BSA)

In sturgeon aquaculture, where the main purpose is caviar production, a reliable method is needed to separate fish according to gender. Currently, due to the lack of external sexual dimorphism, the fish are sexed by an invasive surgical examination of the gonads. Development of a non-invasive procedure for sexing fish based on genetic markers is of special interest. In the present study we empl...

متن کامل

Research Article: Molecular genetic divergence of five genera of cypriniform fish in Iran assessed by DNA barcoding

The present study represents a comprehensive molecular assessment of some family of freshwater fishes in Iran. We analyzed cytochrome oxidase I (COI) sequences for five genus of cypriniform fishes from Iran. The present investigation provides data on genetic structure of some species of Nemachilidae including Paraschistura bampurensis, Oxynoemacheilus kiabii and Turcinemacheilus saadii and Leuc...

متن کامل

Searching the genome of beluga (Huso huso) for sex markers based on targeted Bulked Segregant Analysis (BSA)

In sturgeon aquaculture, where the main purpose is caviar production, a reliable method is needed to separate fish according to gender. Currently, due to the lack of external sexual dimorphism, the fish are sexed by an invasive surgical examination of the gonads. Development of a non-invasive procedure for sexing fish based on genetic markers is of special interest. In the present study we empl...

متن کامل

Fast search in DNA sequence databases using punctuation and indexing

Exact pattern searching in DNA sequence databases has applications in identification of highly conserved regulatory sequences, the design of hybridization probes, and improving performance of approximate homology searching tools such as BLAST and BLAT. We propose a new pattern searching algorithm, CompressedPunctuated-Boyer-Moore (cp-BM), to enhance exact pattern match searches of DNA sequences...

متن کامل

gpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences

Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 14 4  شماره 

صفحات  -

تاریخ انتشار 1998